The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource

نویسندگان

Andrej Zgank

Mirjam Sepesy Maucec

Darinka Verdonik

چکیده

This paper presents a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The manual transcriptions were used to evaluate the quality of unsupervised transcriptions. The average word error rate for the SI TEDx-UM evaluation subset was 50.7%, with out of vocabulary rate of 24% and language model perplexity of 390. The unsupervised transcriptions contain 372k tokens, where 32k of them were different.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SI-PRON Pronunciation Lexicon: a New Language Resource for Slovenian

We present the efforts involved in designing SI-PRON, a comprehensive machine-readable pronunciation lexicon for Slovenian. It has been built from two sources and contains all the lemmas from the Dictionary of Standard Slovenian (SSKJ), the most frequent inflected word forms found in contemporary Slovenian texts, and a first pass of inflected word forms derived from SSKJ lemmas. The lexicon fil...

متن کامل

Recognition of slovenian speech: within and cross-language experiments on monophones using the speechdat(II)

Though the Slovenian SpeechDat(II) database is the largest spoken language resources for Slovenian ever recorded, it belongs to the smaller speech data collections made available by the European LE2-4001 project (http://www.speechdat.org/). The aim of this paper is to analyze this new Slovenian resource and explore the possibilities of supplementing it with data recorded for other languages. Th...

متن کامل

The Universal Dependencies Treebank of Spoken Slovenian

This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperabili...

متن کامل

SI-PRON: A Pronunciation Lexicon for Slovenian

متن کامل

Development of a bilingual spoken dialog system for weather information retrieval

In this paper we present a strategy, current activities and results of a joint project in designing a spoken dialog system for Slovenian and Croatian weather information retrieval. We give a brief description of the system design, of the procedures we have performed in order to obtain domain specific speech databases, monolingual and bilingual speech recognition experiments and WOZ simulation e...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource

نویسندگان

چکیده

منابع مشابه

SI-PRON Pronunciation Lexicon: a New Language Resource for Slovenian

Recognition of slovenian speech: within and cross-language experiments on monophones using the speechdat(II)

The Universal Dependencies Treebank of Spoken Slovenian

SI-PRON: A Pronunciation Lexicon for Slovenian

Development of a bilingual spoken dialog system for weather information retrieval

عنوان ژورنال:

اشتراک گذاری